Refactor MMLSpark for Structured Streaming #134

mhamilton723 · 2017-09-18T18:34:56Z

Replace old readers with new performant Dataframe readers
Remove all references to DF.rdd.mapPartitions

drdarshan · 2017-09-21T20:49:23Z

src/core/schema/src/main/scala/BinaryFileSchema.scala

@@ -1,29 +0,0 @@
-// Copyright (C) Microsoft Corporation. All rights reserved.


Please keep these in their original places since we'll be moving to Spark Images soon anyway.

drdarshan · 2017-09-21T20:54:12Z

src/image-featurizer/src/test/scala/ImageFeaturizerSuite.scala

+      .queryName("images")
+      .start()
+
+    Thread.sleep(3000)


Any way to check for 6 images being found more directly than wait for 3 seconds? What if it took 4 seconds sometimes or 1 second (in which case you'll be blocking for longer than needed).

drdarshan · 2017-09-21T20:56:28Z

src/readers/src/main/scala/BinaryFileFormat.scala

+// Copyright (C) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License. See LICENSE in project root for information.
+
+package org.apache.spark.sql.execution.datasources.binary


consider putting this in org.apache.spark.image package

mhamilton723 · 2017-09-21T21:00:08Z

src/readers/src/main/scala/BinaryFileFormat.scala

+
+    inputStream = fs.open(file)
+    rng.setSeed(filename.hashCode.toLong)
+    if (inspectZip) {


no nested ifs

mhamilton723 · 2017-09-21T21:06:42Z

src/readers/src/main/scala/BinaryFileFormat.scala

+class HadoopFileReader(file: PartitionedFile, conf: Configuration, subsample: Double, inspectZip: Boolean)
+  extends Iterator[BytesWritable] with Closeable {
+
+  Logger.getRootLogger.warn("reading " + file.filePath)


drdarshan · 2017-09-21T21:14:43Z

src/readers/src/main/scala/BinaryReader.scala

+    filteredPaths.map(_.getPath) ++ filteredDirs.flatMap(p => recursePath(fileSystem, p, pathFilter))
+  }
+
+  def streamUnstructured(ssc: StreamingContext, directory: String): InputDStream[(String, BytesWritable)] = {


consider removing support for non-structured stream

drdarshan

Almost there.. please also add a sample notebook since this is a pretty epic change.

mmlspark-bot · 2017-09-23T16:55:56Z

PASS Pass! — The build has succeeded.

MMLSpark 0.8.dev1+7.ge8535c7

This is a new version build!

This is a build for github PR #134, changes:

[e8535c7] Mark Hamilton add python api
[3883acf] Mark Hamilton fixes for sudarshan
[b87a9e6] Mark Hamilton replace and refactor old readers
[7427b6d] Mark Hamilton adding image reader syntactic sugar
[904d4df] Mark Hamilton fixes for streaming
[52fcb93] Mark Hamilton add subsample
[93f1742] Mark Hamilton Add support for streaming images

Source: mhamilton723/mmlspark,
streaming at revision e8535c7
(by Mark Hamilton marhamil@microsoft.com).
Build: MMLSpark, 256419
(built by elbarzil-vm on elbarzil-vm, 2017-09-23 16:08)
Info:
0.8.dev1+7.ge8535c7: mhamilton723/mmlspark/streaming@e8535c7b; MMLSpark#256419
Queued by:
Mark Hamilton for Mark Hamilton
Maven package uploaded, use --packages com.microsoft.ml.spark:mmlspark_2.11:0.8.dev1+7.ge8535c7 and --repositories https://mmlspark.azureedge.net/maven.
PIP package uploaded.
HDInsight: Copy the link to this Script Action to setup this build on a cluster.
Documentation uploaded.

mhamilton723 · 2017-09-21T21:07:01Z

src/readers/src/main/scala/BinaryFileFormat.scala

+/**
+  * Thin wrapper class analogous to others in the spark ecosystem
+  */
+class HadoopFileReader(file: PartitionedFile, conf: Configuration, subsample: Double, inspectZip: Boolean)


private class

mhamilton723 · 2017-09-21T21:11:57Z

src/readers/src/main/scala/BinaryReader.scala

+  def isBinaryFile(df:DataFrame, col: String): Boolean =
+    df.schema(col).dataType == schema
+
+  def recursePath(fileSystem: FileSystem, path: Path, pathFilter:FileStatus => Boolean): Array[Path] ={


check for symlinks

mhamilton723 · 2017-09-21T21:14:51Z

src/readers/src/main/scala/BinaryReader.scala

+    * @param recursive  Recursive search flag
+    * @return           DataFrame with a single column of "binaryFiles", see "columnSchema" for details
+    */
+  def read(path: String, recursive: Boolean, spark: SparkSession,


make sure this works with python monkeypatch

mhamilton723 · 2017-09-21T21:16:51Z

src/readers/src/main/scala/ImageFileFormat.scala

+          case Some(row) =>
+            val imGenRow = new GenericInternalRow(1)
+            val genRow = new GenericInternalRow(ImageReader.columnSchema.fields.length)
+            genRow.update(0, UTF8String.fromString(row.getString(0)))


comment to direct readers to image schema

mhamilton723 · 2017-09-21T21:19:03Z

src/readers/src/main/scala/ImageReader.scala

    * @return returns None if decompression fails
    */
-  private[spark] def decode(filename: String, bytes: Array[Byte]): Option[Row] = {
+  def decode(filename: String, bytes: Array[Byte]): Option[Row] = {


put in right namespace and make private

mhamilton723 · 2017-09-22T21:35:14Z

src/readers/src/main/scala/BinaryReader.scala

+    filteredPaths.map(_.getPath) ++ filteredDirs.flatMap(p => recursePath(fileSystem, p, pathFilter))
+  }
+
+  def streamUnstructured(ssc: StreamingContext, directory: String): InputDStream[(String, BytesWritable)] = {


mhamilton723 · 2017-09-22T21:35:29Z

src/readers/src/main/scala/BinaryFileFormat.scala

+
+    inputStream = fs.open(file)
+    rng.setSeed(filename.hashCode.toLong)
+    if (inspectZip) {


mhamilton723 · 2017-09-22T22:19:24Z

src/image-featurizer/src/test/scala/ImageFeaturizerSuite.scala

+      .queryName("images")
+      .start()
+
+    Thread.sleep(3000)


mhamilton723 · 2017-09-26T15:43:08Z

src/readers/src/main/scala/BinaryFileFormat.scala

+        false
+      }
+    } else {
+      rng.setSeed(filename.hashCode.toLong)


This rng is used internally only, so its not like we are overriding a user supplied rng. I used this method, because distributed, reproducible random splits are a very hard problem. This is because we don't control the list of paths that we load in, spark provides this for us. Also the random split needs to be robust to partitioning strategy, which makes a single RNG impossible, as it would be dependent on the ordering. Here i chose to make the RNG dependent on the filename, which is why it uses the filename as the seed. This will allow for reproducibility, provided the filenames are the same. Also the randomness stays the same because the seeds will all be different provided there are no hash collisions, and in the case of iterating through zip files there random seed is not set every time. I realize now that the above setting of the seeed seems redudant, but harmless so I will remove it and rely on the seed setting in the init

Yes its definitely a hack, but its the least egregious hack i could think of and is fairly performant. I think the real way to do this might be to use the filters provided by the catalyst optimizer, but that involves implementing an entire DSL of filters, and would be more than happy to investigate in a further PR.

mmlspark-bot · 2017-09-28T19:26:48Z

PASS Pass! — The build has succeeded.

MMLSpark 0.8.dev2+3.gccfbee2

This is a new version build!

This is a build for github PR #134, changes:

[ccfbee2] Mark Hamilton add streaming to image nb
[4fe3810] Mark Hamilton add python api
[a28068e] Mark Hamilton Add support for streaming images

Source: mhamilton723/mmlspark,
streaming at revision ccfbee2
(by Mark Hamilton marhamil@microsoft.com).
Build: MMLSpark, 265844
(built by elbarzil-vm on elbarzil-vm, 2017-09-28 18:36)
Info:
0.8.dev2+3.gccfbee2: mhamilton723/mmlspark/streaming@ccfbee25; MMLSpark#265844
Queued by:
Mark Hamilton for Mark Hamilton
Maven package uploaded, use --packages com.microsoft.ml.spark:mmlspark_2.11:0.8.dev2+3.gccfbee2 and --repositories https://mmlspark.azureedge.net/maven.
PIP package uploaded.
HDInsight: Copy the link to this Script Action to setup this build on a cluster.
Documentation uploaded.

mmlspark-bot · 2017-09-29T21:27:33Z

PASS Pass! — The build has succeeded.

MMLSpark 0.8.dev2+3.g544b32f

This is a new version build!

This is a build for github PR #134, changes:

[544b32f] Mark Hamilton add streaming to image nb
[4fe3810] Mark Hamilton add python api
[a28068e] Mark Hamilton Add support for streaming images

Source: mhamilton723/mmlspark,
streaming at revision 544b32f
(by Mark Hamilton marhamil@microsoft.com).
Build: MMLSpark, 268665
(built by elbarzil-vm on elbarzil-vm, 2017-09-29 20:39)
Info:
0.8.dev2+3.g544b32f: mhamilton723/mmlspark/streaming@544b32f8; MMLSpark#268665
Queued by:
Mark Hamilton for Mark Hamilton
Maven package uploaded, use --packages com.microsoft.ml.spark:mmlspark_2.11:0.8.dev2+3.g544b32f and --repositories https://mmlspark.azureedge.net/maven.
PIP package uploaded.
HDInsight: Copy the link to this Script Action to setup this build on a cluster.
Documentation uploaded.

drdarshan

Thank you for working through all the comments!

mmlspark-bot · 2017-10-03T19:57:17Z

PASS Pass! — The build has succeeded.

MMLSpark 0.8.dev6+4.g65a3635

This is a build for github PR #134, changes:

[65a3635] Mark Hamilton Add a seed to customize reader's subsampling
[53b914a] Mark Hamilton Add streaming to the 302 sample notebook
[f10d7dd] Mark Hamilton Add the Python API for ImageReader and BinaryFileReader
[55d41b3] Mark Hamilton Add ImageReader and BinaryFileReader to support streaming images

Source: mhamilton723/mmlspark,
streaming at revision 65a3635
(by Mark Hamilton marhamil@microsoft.com).
Build: MMLSpark, 273731
(built by elbarzil-vm on elbarzil-vm, 2017-10-03 19:07)
Info:
0.8.dev6+4.g65a3635: mhamilton723/mmlspark/streaming@65a36356; MMLSpark#273731
Queued by:
Eli Barzilay for Eli Barzilay
Maven package uploaded, use --packages com.microsoft.ml.spark:mmlspark_2.11:0.8.dev6+4.g65a3635 and --repositories https://mmlspark.azureedge.net/maven.
PIP package uploaded.
HDInsight: Copy the link to this Script Action to setup this build on a cluster.
Documentation uploaded.

msftclas added the cla-not-required label Sep 18, 2017

mhamilton723 force-pushed the streaming branch 3 times, most recently from 6e55de4 to 5a0dfa8 Compare September 18, 2017 20:22

mhamilton723 changed the title ~~Refactor Image Reader~~ Refactor MMLSpark for Structured Streaming Sep 19, 2017

mhamilton723 force-pushed the streaming branch 6 times, most recently from f4e93cb to 20d048e Compare September 20, 2017 02:44

mhamilton723 assigned drdarshan Sep 20, 2017

mhamilton723 force-pushed the streaming branch 2 times, most recently from fc00299 to b87a9e6 Compare September 20, 2017 23:52

drdarshan reviewed Sep 21, 2017

View reviewed changes

mhamilton723 commented Sep 21, 2017

View reviewed changes

drdarshan reviewed Sep 21, 2017

View reviewed changes

drdarshan suggested changes Sep 21, 2017

View reviewed changes

elibarzilay assigned mhamilton723 and unassigned drdarshan Sep 22, 2017

mhamilton723 force-pushed the streaming branch 2 times, most recently from 234dc65 to 3883acf Compare September 22, 2017 22:20

microsoft deleted a comment from mmlspark-bot Sep 24, 2017

drdarshan removed their assignment Sep 25, 2017

msftgits removed the cla-not-required label Sep 26, 2017

mhamilton723 commented Sep 26, 2017

View reviewed changes

microsoft deleted a comment from msftclas Sep 27, 2017

mhamilton723 force-pushed the streaming branch from ad159a0 to ccfbee2 Compare September 28, 2017 18:35

mhamilton723 force-pushed the streaming branch 4 times, most recently from f28f3db to 544b32f Compare September 29, 2017 20:37

microsoft deleted a comment from mmlspark-bot Sep 29, 2017

mhamilton723 assigned drdarshan and unassigned mhamilton723 Sep 30, 2017

drdarshan previously approved these changes Oct 3, 2017

View reviewed changes

drdarshan assigned elibarzilay and unassigned drdarshan Oct 3, 2017

mmlspark-bot dismissed drdarshan’s stale review via 3da51d6 October 3, 2017 18:02

mmlspark-bot force-pushed the streaming branch 2 times, most recently from 3da51d6 to c620d9f Compare October 3, 2017 18:16

mhamilton723 added 3 commits October 3, 2017 15:04

Add ImageReader and BinaryFileReader to support streaming images

55d41b3

Add the Python API for ImageReader and BinaryFileReader

f10d7dd

Add streaming to the 302 sample notebook

53b914a

mmlspark-bot force-pushed the streaming branch from a47b0b3 to 65a3635 Compare October 3, 2017 19:04

Add a seed to customize reader's subsampling

e52aecf

mmlspark-bot force-pushed the streaming branch from 65a3635 to e52aecf Compare October 3, 2017 19:12

elibarzilay merged commit 4f1077e into microsoft:master Oct 3, 2017

drdarshan mentioned this pull request Oct 3, 2017

Provide sample on how to deploy a model in Spark #88

Closed

mhamilton723 deleted the streaming branch October 4, 2017 20:22

		@@ -1,29 +0,0 @@
		// Copyright (C) Microsoft Corporation. All rights reserved.

Refactor MMLSpark for Structured Streaming #134

Refactor MMLSpark for Structured Streaming #134

Uh oh!

Conversation

mhamilton723 commented Sep 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drdarshan left a comment

Choose a reason for hiding this comment

Uh oh!

mmlspark-bot commented Sep 23, 2017

MMLSpark 0.8.dev1+7.ge8535c7

This is a build for github PR #134, changes:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmlspark-bot commented Sep 28, 2017

MMLSpark 0.8.dev2+3.gccfbee2

This is a build for github PR #134, changes:

Uh oh!

mmlspark-bot commented Sep 29, 2017

MMLSpark 0.8.dev2+3.g544b32f

This is a build for github PR #134, changes:

Uh oh!

drdarshan left a comment

Choose a reason for hiding this comment

Uh oh!

mmlspark-bot commented Oct 3, 2017

MMLSpark 0.8.dev6+4.g65a3635

This is a build for github PR #134, changes:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

mhamilton723 commented Sep 18, 2017 •

edited

Loading